Immune Cell Infiltration Across Breast Cancer Subtypes: A Marker Gene Analysis of the METABRIC Cohort

Author

cyy36@cam.ac.uk

Published

February 23, 2026

Cover Page

MSt Healthcare Data Science

Authentication of practice

I confirm that I have fully read and understood the assignment brief for this module. Y

Details

Name: Chung Yan Surname: Yu
Submission Date
Word Count: whole assignment including codes
Word Count: main body excluding abstract, references and supplementary materials

Permission to share your assignment

I do give permission to share my assignment with future MSt participants.

University statement of originality

This assignment is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have previously submitted for a degree or diploma or other qualification at the University of Cambridge or any other university or similar institution, or that is being concurrently submitted, except as declared in the Preface and specific in the text. I further state that no substantial part of my Portfolio has already been submitted, nor is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other university or similar institution except as declared in the Preface and specified in the text.

I confirm the statement of originality as above Y

Questions for reflection

Self-assessment is an important aspect of feedback literacy, which is, in turn, key to the development of expertise. As you proceed through the MSt Healthcare data science programme, we hope that you will make use of the following prompts to assess your own work on assignments. Specific assignment briefs will likely indicate which of these to address for which assessments, but, in general, we expect you to respond to one or two for each assignment on your course.

For each of the questions, do not spend too long answering – keep it brief. For each question you answer, limit yourself to no more than three items. And please remember, this is optional and developmental: these cover sheets are designed to create space for self-assessment and feedback dialogue, rather than additional assignment workload.

  1. Which aspects of this assignment are you most uncertain about and/or would most like to receive feedback on?

  2. What elements are you left pondering after this assignment that you would like to discuss further?

  3. How have you incorporated feedback from peers and tutors into this assignment?

  4. How, and to what extent, have you been able to incorporate feedback on previous course work into this assignment?

  5. Using the wording in the rubric, how would you describe the quality of the different aspects of your work?

Declaration of the use of generative AI

Which permitted use of generative AI are you acknowledging? Semantic search of literature and notes, outline creation, output formatting, informed feedback, code generation
Which generative AI tool did you use (name and version)? Claude Code v2.1.1x (Opus 4.6), M365 CoPilot
What did you use the tool for? Searching for relevant journal publications (via PubMed MCP), collating notes in my Obsidian vault, .bib file curation, code debugging+generation
How have you used or changed the generative AI’s output My collated notes always stay in point form so that I write out the paragraphs in my own words. Feedback provided by the GenAI models are weighed and assessed before being acted on.

Abstract

The ability for immune cells to infiltrate the tumour microenvironment (TME) plays a key role in the biological war machine taking on breast cancer.

Introduction

Tumour cells evading the detection and wrath of the immune system is a recognised hallmark of cancer1, and breast cancer is no exception2. The interplay between tumour cells, immune cells and other local factors form the tumour microenvironment (TME), and research in the TME of breast cancers has yielded clinical applications3. The amount of tumour infiltrating immune cells in breast cancers has shown prognostic value4, and this amount and prognostic impact varies across breast cancer’s molecular subtypes5. These subtypes are defined by gene expression6, and are popularised and commercialised as the PAM50 gene expression panel, which has its own prognostic significance7. The subtypes include luminal A, luminal B, HER2-enriched, basal-like and normal-like.

The METABRIC cohort from Curtis et al.8 is one of the larger datasets on breast cancer containing clinical, gene expression and mutation data of 2,509 patients. METABRIC’s analysis has already shown immune infiltration in one of their integrative clusters (IntClust4), showing deletions in T cell receptors and mRNA amplification for immune cell activity, but stopped short of fully characterising the immune cells involved. Jiang et al.9 characterised immune cell infiltration in the METABRIC cohort using CIBERSORT10, developing new immunotypes with prognostic significance. However, CIBERSORT is more of a black box model, a computational approach using deconvolution to estimate cell proportions from RNA mixtures. Therefore there exists a need to use a transparent approach to score immune cell infiltration in METABRIC’s breast tumour samples, and scrutinise any prognostic value from such scoring.

A panel of immune cell marker genes developed by Danaher et al.11 were used to score immune infiltration in the METABRIC cohort. The infiltration of different immune cell types as well as their prognostic value were investigated across the PAM50 subtypes. The goal is to characterise the immune landscape across PAM50 subtypes, identify which immune populations associate with survival, and test whether immune cell scores or multi-dimensional immune profiles improve prognostic value beyond subtype classification alone.

Methods

Click here to see how to pull fresh data straight from cBioPortal.
# To pull fresh data from cBioPortal instead of using the included processed files,
# delete the data/processed/ folder and re-render this document.
source("scripts/00-data-cleaning.R")
source("scripts/01-immune-cell-scoring.R")
source("scripts/02-task1-exploration.R")

This study makes use of the METABRIC dataset8 available on cBioPortal1214, including clinical, gene expression and mutation datasets of 2,509 patients. For gene expression, the log2 intensity was used in place of the normalised z-score to preserve the proportions when applying the immune cell scoring.

The principle of avoiding circularity guided two data selection decisions. First, PAM50 subtypes were chosen over IntClust as IntClust subtypes are defined by copy number aberrations that can directly influence immune marker gene expression, and IntClust4 is already characterised by immune infiltration. The clear lack of overlap between the PAM50 and Danaher’s immune cell gene panels also strengthened this choice. Second, the exclusion of Claudin-low patients. Claudin-low is another breast cancer subtype used alongside PAM50 subtypes on cBioPortal’s METABRIC dataset, and is characterised by immune infiltration15. Fougner et al.15 further demonstrated that Claudin-low is not an independent subtype in addition to the PAM50 subtypes, therefore patients of this category (n = 218) were removed.

Immune cell scoring is done by taking the mean log2 gene expression of the cell types’ marker genes, originally defined by Danaher et al.11 and tabulated in Table S1. Of the 60 marker genes from the Danaher panel, only 58 genes have expression data in the METABRIC dataset. TPSB2 is missing for mast cells and XCL2 for NK cells. The CD4 T cell score is derived by subtracting the CD8 T cell score from the overall T cell score. The Danaher scoring method was chosen for its simplicity and marker genes validated against flow cytometry, in contrast to deconvolution approaches previously applied to this cohort9.

Immune variation across PAM50 subtypes was assessed by Kruskal-Wallis test with Wilcoxon for pairwise comparisons. Prognostic associations were evaluated using Cox proportional hazards models, first univariately and then multivariately to adjust for PAM50 subtypes, along with Kaplan-Meier curves with log-rank tests for visualisation. To develop multi-dimensional immune profiles, clusters were fitted via k-means with silhouette scoring. All p-values from multiple testing were corrected via Benjamini-Hochberg false discovery rate (FDR) and reported as FDR throughout. Overall survival was used for prognosis as the disease-specific survival data used by Curtis et al.8 was not available on cBioPortal.

Cell type Markers Found Genes used Genes missing
B-cells 9 9 BLK, CD19, FCRL2, MS4A1, FAM30A, TNFRSF17, TCL1A, SPIB, PNOC
CD45 1 1 PTPRC
CD8 T cells 2 2 CD8A, CD8B
Cytotoxic cells 10 10 PRF1, GZMA, GZMB, NKG7, GZMH, KLRK1, KLRB1, KLRD1, CTSW, GNLY
DC 3 3 CCL13, CD209, HSD11B1
Exhausted CD8 4 4 LAG3, CD244, EOMES, PTGER4
Macrophages 4 4 CD68, CD84, CD163, MS4A4A
Mast cells 5 4 TPSAB1, CPA3, MS4A2, HDC TPSB2
NK CD56dim 4 4 KIR2DL3, KIR3DL1, KIR3DL2, IL21R
NK cells 3 2 XCL1, NCR1 XCL2
Neutrophils 7 7 FPR1, SIGLEC5, CSF3R, FCAR, FCGR3B, CEACAM3, S100A12
T-cells 6 6 CD6, CD3D, CD3E, SH2D1A, TRAT1, CD3G
Th1 cells 1 1 TBX21
Treg 1 1 FOXP3
NoteTable S1 — Immune marker gene coverage

Verifying the immune marker gene coverage on the Illumina HT-12 v3 microarray of METABRIC dataset.

Results

Cohort Summary and Mutation Landscape

Characteristic Overall
N = 1756 (100%)
Basal-like
N = 209 (12%)
HER2-Enriched
N = 224 (13%)
Luminal A
N = 700 (40%)
Luminal B
N = 475 (27%)
Normal-like
N = 148 (8.4%)
p-value
Age at diagnosis 62 (52, 71) 54 (44, 65) 59 (50, 69) 63 (53, 72) 67 (59, 74) 57 (46, 66) <0.001
ER status





<0.001
    Negative 339 (20%) 172 (84%) 115 (53%) 13 (1.9%) 11 (2.4%) 28 (19%)
    Positive 1,386 (80%) 33 (16%) 103 (47%) 677 (98%) 456 (98%) 117 (81%)
    Unknown 31 4 6 10 8 3
Tumour Grade





<0.001
    1 154 (9.2%) 2 (1.0%) 4 (1.9%) 118 (18%) 18 (3.9%) 12 (8.6%)
    2 714 (42%) 17 (8.3%) 54 (25%) 378 (57%) 186 (41%) 79 (57%)
    3 815 (48%) 187 (91%) 155 (73%) 172 (26%) 253 (55%) 48 (35%)
    Unknown 73 3 11 32 18 9
Cellularity





<0.001
    High 898 (52%) 124 (60%) 125 (59%) 326 (47%) 292 (63%) 31 (23%)
    Low 163 (9.5%) 28 (14%) 15 (7.0%) 53 (7.7%) 22 (4.7%) 45 (34%)
    Moderate 650 (38%) 54 (26%) 73 (34%) 312 (45%) 153 (33%) 58 (43%)
    Unknown 45 3 11 9 8 14
Follow-up (months) 116 (60, 184) 86 (33, 183) 97 (39, 172) 130 (82, 197) 104 (55, 164) 121 (64, 176) <0.001
Deceased 1,044 (59%) 115 (55%) 157 (70%) 378 (54%) 315 (66%) 79 (53%) <0.001
NoteTable 1 — Cohort characteristics by PAM50 subtype

Baseline characteristics of 1,756 METABRIC patients with expression data, stratified by PAM50 subtype. 218 claudin-low and 6 NC patients excluded. Continuous variables summarised as median (Q1, Q3); categorical as n (%). P-values from Kruskal-Wallis (continuous) or Pearson’s Chi-squared (categorical) tests.

NoteFigure 1 — Data availability across modalities

Intersection plot showing patient counts for each combination of available data modalities (N = 2,509).

  • 529 patients without expression data
  • significantly younger than the 1980 with expression data; Wilcoxon rank-sum p < 0.001
  • median age 58 vs 61.8 years

The 529 patients without expression data were significantly younger than the 1980 with expression data (median age 58 vs 61.8 years; Wilcoxon rank-sum p < 0.001).

NoteFigure 2 — Mutation-immune cell score associations

Median immune cell score differences (mutated minus wild-type) for genes with at least one significant association across 14 cell types. Significance by Wilcoxon rank-sum test (FDR < 0.05); only genes mutated in ≥10 patients tested.

NoteFigure 3 — TP53 mutation and immune infiltration after PAM50 adjustment

PAM50-adjusted effect sizes from linear regression (x-axis) and significance (y-axis) for each immune cell type. Dashed line marks FDR = 0.05.

Immune Infiltration Across PAM50 Subtypes

NoteFigure 4 — Immune cell score heatmap across PAM50 subtypes

Z-scored immune cell scores for 1,756 patients, split and hierarchically clustered by PAM50 subtype. Blue = below-mean; red = above-mean infiltration.

NoteFigure 5 — Immune cell scores by PAM50 subtype

Violin plots with box plots for cell types reaching Kruskal-Wallis FDR < 0.05. Lettering above each subtype plot indicate similarity/difference, subtypes sharing a letter do not differ significantly (pairwise Wilcoxon, FDR < 0.05).

Immune Scores and Overall Survival

Cell type HR Lower 95% CI Upper 95% CI p-value C-index FDR
Mast cells 0.885 0.831 0.942 <0.001 0.550 0.002
Macrophages 1.091 1.028 1.159 0.004 0.543 0.031
B cells 0.931 0.872 0.993 0.029 0.512 0.102
Treg 1.071 1.009 1.138 0.025 0.527 0.102
CD8 T cells 0.939 0.883 0.999 0.047 0.508 0.131
Cytotoxic cells 0.948 0.891 1.009 0.093 0.499 0.217
NK cells 1.051 0.988 1.119 0.115 0.510 0.230
Neutrophils 1.041 0.982 1.103 0.174 0.514 0.287
Th1 cells 1.042 0.980 1.108 0.185 0.515 0.287
CD4 T cells 1.025 0.968 1.086 0.395 0.512 0.554
Exhausted CD8 0.980 0.923 1.041 0.519 0.493 0.660
NK CD56dim 0.989 0.931 1.050 0.708 0.484 0.826
CD45 0.992 0.931 1.057 0.804 0.491 0.860
DC 1.006 0.945 1.070 0.860 0.518 0.860
NoteFigure 6 / Table 2 — Univariate Cox regression: immune scores and overall survival

Hazard ratios per SD increase from univariate Cox proportional hazards models for each immune cell type (FDR < 0.05).

Despite reaching FDR significance, the best-performing single cell type (Mast cells) achieved a concordance of only 0.55, indicating modest discriminative ability. Given the strong subtype dependence of immune infiltration observed in Q1, we next asked whether adjusting for PAM50 subtype alters the prognostic landscape.

PAM50-Adjusted Immune Prognostic Analysis

To account for subtype confounding, we fitted multivariable Cox models adjusting for PAM50 subtype for each of the 14 cell types. 4 cell types reached FDR < 0.05 after adjustment (B cells, Cytotoxic cells, CD8 T cells, Mast cells). For these, we additionally examined per-subtype effects.

Cell type HR Lower 95% CI Upper 95% CI p-value FDR C-index
B cells 0.889 0.830 0.952 <0.001 0.011 0.587
Cytotoxic cells 0.903 0.845 0.965 0.003 0.018 0.582
CD8 T cells 0.914 0.858 0.974 0.006 0.024 0.582
Mast cells 0.911 0.852 0.975 0.007 0.024 0.584
NK CD56dim 0.925 0.865 0.989 0.022 0.061 0.579
Treg 1.066 1.003 1.132 0.038 0.089 0.582
Exhausted CD8 0.942 0.883 1.005 0.069 0.139 0.578
NK cells 1.056 0.992 1.123 0.086 0.151 0.578
Neutrophils 1.046 0.987 1.108 0.132 0.205 0.581
Th1 cells 1.045 0.983 1.110 0.160 0.208 0.580
Macrophages 1.045 0.982 1.111 0.164 0.208 0.582
CD45 0.973 0.913 1.037 0.394 0.459 0.573
DC 0.978 0.917 1.043 0.499 0.537 0.575
CD4 T cells 1.006 0.949 1.066 0.852 0.852 0.577
NoteFigure 7 / Table 3 — PAM50-adjusted Cox regression and per-subtype effects

Cox proportional hazards models adjusted for PAM50 subtype. Forest plot shows the 4 cell types reaching FDR < 0.05; table shows all 14.

Interaction models (immune score × PAM50 subtype) revealed no significant heterogeneity for B cells (p = 0.047, FDR = 0.189), Cytotoxic cells (p = 0.250, FDR = 0.333), CD8 T cells (p = 0.443, FDR = 0.443), Mast cells (p = 0.102, FDR = 0.205), suggesting that while the magnitude of the prognostic effect varies by subtype, the direction is broadly consistent.

NoteFigure 8 — Kaplan-Meier survival curves for PAM50-adjusted significant cell types

Patients split at median score. P-values from log-rank test; titles show PAM50-adjusted HR and FDR.

Aggregating Immune Scores via Clustering

NoteFigure 9 — Immune cluster centroids

Mean z-scored immune cell type profiles for k-means clusters. Left: 2 clusters from all 14 cell types. Right: 2 clusters from the 4 PAM50-adjusted significant cell types only. Clusters labelled by overall immune level. Red = above-average infiltration; blue = below-average.

Model df Concordance AIC LR p vs PAM50 ΔAIC
PAM50 subtype only 4 0.576 14057.6 0.0
PAM50 + B cells 5 0.587 14047.5 <0.001 -10.1
PAM50 + Cytotoxic cells 5 0.582 14050.3 0.002 -7.3
PAM50 + CD8 T cells 5 0.582 14051.8 0.005 -5.8
PAM50 + Mast cells 5 0.584 14052.2 0.007 -5.4
PAM50 + Cluster (4 sig, k=2) 5 0.577 14056.6 0.086 -0.9
PAM50 + Cluster (all 14, k=2) 5 0.576 14058.5 0.314 1.0
NoteTable 4 — Prognostic comparison: PAM50 baseline vs immune-augmented models

LR test p-values against PAM50-only baseline. ΔAIC relative to baseline (negative = better fit).

Discussion

Conclusion

References

1.
Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: The next generation. Cell 144, 646–674 (2011).
2.
Gil Del Alcazar, C. R., Alečković, M. & Polyak, K. Immune escape during breast tumor progression. Cancer Immunology Research 8, 422–427 (2020).
3.
4.
5.
Stanton, S. E. & Disis, M. L. Clinical significance of tumor-infiltrating lymphocytes in breast cancer. Journal for ImmunoTherapy of Cancer 4, 59 (2016).
6.
Perou, C. M. et al. Molecular portraits of human breast tumours. Nature 406, 747–752 (2000).
7.
Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology 27, 1160–1167 (2009).
8.
9.
10.
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nature Methods 12, 453–457 (2015).
11.
Danaher, P. et al. Gene expression markers of tumor infiltrating leukocytes. Journal for ImmunoTherapy of Cancer 5, 18 (2017).
12.
13.
14.
15.
Fougner, C., Bergholtz, H., Norum, J. H. & Sørlie, T. Re-definition of claudin-low as a breast cancer phenotype. Nature Communications 11, 1787 (2020).

Addendum

Even if my day job is somewhat cancer adjacent, diving deep into this METABRIC dataset made me feel like a novice when it came to cancer / breast cancer. But it has been fun seeing what we learn in class come together in more ways than one. The lead authors for the immune cell scoring paper are from the company NanoString Technologies, the same company that has the commercial PAM50 test kit that we learned in class. And in the same paper, they cite the false discovery rate (FDR) correction paper from Benjamini and Hochberg!

This whole assignment has been a rollercoaster ride, and there were so many rabbit holes that this topic would lead you into. For example Danaher didn’t make it easy for me when they tweaked the TIL acronym, usually meaning tumour infiltrating lymphocytes, to sneak in leukocytes in the palce of lymphocytes, so that the immune cells of myeloid origin could be included in the paper. The inclusion of Claudin-low